A password, sometimes called a passcode, is a memorized secret used to confirm the identity of a user. Despite recent awareness on the need to use strong password to ward off potential hackers hacking into and acquiring users’ sensitive information, there are still several lists of bad passwords that are being used worldwide.
library(tidyverse)## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
passwords <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')## Parsed with column specification:
## cols(
## rank = col_double(),
## password = col_character(),
## category = col_character(),
## value = col_double(),
## time_unit = col_character(),
## offline_crack_sec = col_double(),
## rank_alt = col_double(),
## strength = col_double(),
## font_size = col_double()
## )
passwords## # A tibble: 507 x 9
## rank password category value time_unit offline_crack_s… rank_alt strength
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 password passwor… 6.91 years 2.17 1 8
## 2 2 123456 simple-… 18.5 minutes 0.0000111 2 4
## 3 3 12345678 simple-… 1.29 days 0.00111 3 4
## 4 4 1234 simple-… 11.1 seconds 0.000000111 4 4
## 5 5 qwerty simple-… 3.72 days 0.00321 5 8
## 6 6 12345 simple-… 1.85 minutes 0.00000111 6 4
## 7 7 dragon animal 3.72 days 0.00321 7 8
## 8 8 baseball sport 6.91 years 2.17 8 4
## 9 9 football sport 6.91 years 2.17 9 7
## 10 10 letmein passwor… 3.19 months 0.0835 10 8
## # … with 497 more rows, and 1 more variable: font_size <dbl>
Take note: Below are the definitions of each column
rank: Popularity in their database of released passwords password: Actual text of the password category: What category does the password fall in to? value: Time to crack by online guessing time_unit: Time unit to match with value offline_crack_sec: Time to crack offline in seconds rank_alt: Rank 2 strength: Strength = quality of password where 10 is highest, 1 is lowest, please note that these are relative to these generally bad passwords font_size: Used to create the graphic for KIB
str(passwords)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 507 obs. of 9 variables:
## $ rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ password : chr "password" "123456" "12345678" "1234" ...
## $ category : chr "password-related" "simple-alphanumeric" "simple-alphanumeric" "simple-alphanumeric" ...
## $ value : num 6.91 18.52 1.29 11.11 3.72 ...
## $ time_unit : chr "years" "minutes" "days" "seconds" ...
## $ offline_crack_sec: num 2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
## $ rank_alt : num 1 2 3 4 5 6 7 8 9 10 ...
## $ strength : num 8 4 4 4 8 4 8 4 7 8 ...
## $ font_size : num 11 8 8 8 11 8 11 8 11 11 ...
## - attr(*, "spec")=
## .. cols(
## .. rank = col_double(),
## .. password = col_character(),
## .. category = col_character(),
## .. value = col_double(),
## .. time_unit = col_character(),
## .. offline_crack_sec = col_double(),
## .. rank_alt = col_double(),
## .. strength = col_double(),
## .. font_size = col_double()
## .. )
passwords$category <- as.factor(passwords$category)
class(passwords$category)## [1] "factor"
passwords$time_unit <- as.factor(passwords$time_unit)
class(passwords$time_unit)## [1] "factor"
passwords %>%
is.na() %>%
colSums()## rank password category value
## 7 7 7 7
## time_unit offline_crack_sec rank_alt strength
## 7 7 7 7
## font_size
## 7
As a rule of thumb, because the number of NA is below 5% of the data, we can delete the rows on the missing data.
passwords_new <- passwords %>%
drop_na(rank, password, category, value, time_unit, offline_crack_sec, rank_alt, strength, font_size)Check the number of NA once again
passwords_new %>%
is.na() %>%
colSums()## rank password category value
## 0 0 0 0
## time_unit offline_crack_sec rank_alt strength
## 0 0 0 0
## font_size
## 0
str(passwords_new)## Classes 'tbl_df', 'tbl' and 'data.frame': 500 obs. of 9 variables:
## $ rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ password : chr "password" "123456" "12345678" "1234" ...
## $ category : Factor w/ 10 levels "animal","cool-macho",..: 7 9 9 9 9 9 1 10 10 7 ...
## $ value : num 6.91 18.52 1.29 11.11 3.72 ...
## $ time_unit : Factor w/ 7 levels "days","hours",..: 7 3 1 5 1 3 1 7 7 4 ...
## $ offline_crack_sec: num 2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
## $ rank_alt : num 1 2 3 4 5 6 7 8 9 10 ...
## $ strength : num 8 4 4 4 8 4 8 4 7 8 ...
## $ font_size : num 11 8 8 8 11 8 11 8 11 11 ...
summary(passwords_new)## rank password category value
## Min. : 1.0 Length:500 name :183 Min. : 1.290
## 1st Qu.:125.8 Class :character cool-macho : 79 1st Qu.: 3.430
## Median :250.5 Mode :character simple-alphanumeric: 61 Median : 3.720
## Mean :250.5 fluffy : 44 Mean : 5.603
## 3rd Qu.:375.2 sport : 37 3rd Qu.: 3.720
## Max. :500.0 nerdy-pop : 30 Max. :92.270
## (Other) : 66
## time_unit offline_crack_sec rank_alt strength
## days :238 Min. : 0.00000 Min. : 1.0 Min. : 0.000
## hours : 43 1st Qu.: 0.00321 1st Qu.:125.8 1st Qu.: 6.000
## minutes: 51 Median : 0.00321 Median :251.5 Median : 7.000
## months : 87 Mean : 0.50001 Mean :251.2 Mean : 7.432
## seconds: 11 3rd Qu.: 0.08350 3rd Qu.:376.2 3rd Qu.: 8.000
## weeks : 5 Max. :29.27000 Max. :502.0 Max. :48.000
## years : 65
## font_size
## Min. : 0.0
## 1st Qu.:10.0
## Median :11.0
## Mean :10.3
## 3rd Qu.:11.0
## Max. :28.0
##
From the data above, we can conlude a few things: 1. There are 10 categories of bad passwords used by people. 2. Based on the category of the passwords, the category name is the most frequently used 3. The mean strength of these bad passwords (considering that they are bad password) is 7.432 out of 10, where 10 is the highest and 1 is the lowest, whereas the median strength is 7. 4. The average value to crack these bad passwords online is roughly 5.603 days whereas the median value is 3.720 days. (Days is chosen as it is the mode of the time unit) 5. The mean value to crack these passwords offline is 0.5 seconds while the median is 0.00321 seconds.
Next, we are going to examine the relationship between the strength of these passwords and the time to crack these passwords offline as the strength of the passwords is solely based on the time for computers to crack the passwords online, instead of through guessing offline.
library(ggplot2)
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot1 <- ggplot(data = passwords_new, mapping = aes(x = strength, y = offline_crack_sec)) +
geom_jitter(aes(color = category)) +
geom_smooth(method = "auto") +
labs(x = "Strength", y = "Time to Crack Offline in Seconds", title = "Time to crack offline in seconds vs Strength of Passwords based on Online Guessing") +
theme_minimal() +
theme(legend.position = "none")
ggplotly(plot1)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the plot above, we can conclude that there is a weak positive correlation between the time to crack these passwords offline and the strength of the passwords based on online guessing, although it is to be noted that there are extreme outliers that have take very little time to crack offline but is regarded as strong password by computers.
Next, we will try to see the strength of these passwords based on their category to learn some insight on which type of password is more easily guessed by computers.
plot2 <- ggplot(data = passwords_new, mapping = aes(x = category, y = strength)) +
geom_boxplot(aes(fill = category)) +
labs(x = "Category of passwords", y = "Strength", title = "Strength of Passwords based on their Categories") +
theme_minimal() +
theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
ggplotly(plot2)Based on this box plot alone, we can tell that passwords that are classified as sport, nerdy pop, name and cool macho have the highest median strength of 8 while simple alphanumeric passwords have the lowest median strength of 4.
Both simple alphanumeric passwords and nerdy pop passwords have the most amount of outliers (5), however, nerdy pop passwords’ outliers tend to have higher strength than simple alphanumeric passwords’ outliers.
On the other hand, let’s compare the rank of these passwords based on their categories.
plot3 <- ggplot(data = passwords_new, mapping = aes(x = category, y = rank)) +
geom_boxplot(aes(fill = category)) +
labs(x = "Category of passwords", y = "Rank", title = "Popularity of Passwords based on their Categories") +
theme_minimal() +
theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
ggplotly(plot3)Let’s try to combine the conclusions of these two graphs to have a more meaningful insight.
Both boxplots show that despite cool macho passwords being the strongest type of passwords, people do not prefer to use this type of password the most. Similarly, passwords with categories such as name and nerdy pop also do not fare well in terms of popularity and usage despite being the strongest to withstand cracking via computer.
Next, let’s compare it via guessing offline.
plot4 <- ggplot(data = passwords_new, mapping = aes(x = category, y = offline_crack_sec)) +
geom_boxplot(aes(fill = category)) +
scale_y_continuous(breaks = seq(0,0.005), limit = NA) +
labs(x = "Category of passwords", y = "Time to crack offline in seconds", title = "Time to crack these passwords offline based on their Categories") +
theme_minimal() +
theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
ggplotly(plot4)From this, we can only find out that the median time required to crack nerdy-pop passwords offline is the highest at 0.04s.
This concludes that nerdy-pop passwords is the strongest to crack, both offline and online.
This conclusion is particularly useful to users when deciding which type of passwords to use in order to maximise their safety against hackers.
Next, we want to know the most used password, yet has the least strength to withstand cracking offline and online, so that we know which particular password to avoid using ever.
passrank <- passwords_new %>%
filter(strength == 0) %>%
filter(rank < 100)
passrank## # A tibble: 5 x 9
## rank password category value time_unit offline_crack_s… rank_alt strength
## <dbl> <chr> <fct> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 19 111111 simple-… 18.5 minutes 0.0000111 19 0
## 2 20 2000 simple-… 11.1 seconds 0.000000111 20 0
## 3 46 pepper food 3.72 days 0.00321 46 0
## 4 60 666666 simple-… 18.5 minutes 0.0000111 60 0
## 5 77 1111 simple-… 11.1 seconds 0.000000111 77 0
## # … with 1 more variable: font_size <dbl>
library(ggrepel)
plot5 <- ggplot(data = passwords_new, mapping = aes(x = password, y = rank)) +
geom_jitter(aes(colour = strength)) +
geom_label_repel(data = passrank, aes(label = password), size = 2) +
facet_wrap(~strength) +
labs(x = NULL, y = "Rank", title = "Passwords based on rank and strength") +
theme_minimal() +
theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_blank())
plot5Based on this, we highlighted on passwords 2000, 111111, pepper, 1111 and 666666 with rank of below 100 and of strength 0. Now, let’s see whether these passwords will be highlighted again when compared with time to crack them offline.
passtime <- passwords_new %>%
filter(offline_crack_sec < median(offline_crack_sec)) %>%
filter(rank < 100)
passtime## # A tibble: 19 x 9
## rank password category value time_unit offline_crack_s… rank_alt strength
## <dbl> <chr> <fct> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 2 123456 simple-… 18.5 minutes 0.0000111 2 4
## 2 3 12345678 simple-… 1.29 days 0.00111 3 4
## 3 4 1234 simple-… 11.1 seconds 0.000000111 4 4
## 4 6 12345 simple-… 1.85 minutes 0.00000111 6 4
## 5 12 696969 simple-… 18.5 minutes 0.0000111 12 1
## 6 19 111111 simple-… 18.5 minutes 0.0000111 19 0
## 7 20 2000 simple-… 11.1 seconds 0.000000111 20 0
## 8 24 1234567 simple-… 3.09 hours 0.000111 24 4
## 9 34 test passwor… 7.92 minutes 0.00000475 34 4
## 10 35 pass passwor… 7.92 minutes 0.00000475 35 3
## 11 42 love fluffy 7.92 minutes 0.00000475 42 6
## 12 45 6969 simple-… 11.1 seconds 0.000000111 45 4
## 13 50 654321 simple-… 18.5 minutes 0.0000111 50 4
## 14 58 123123 simple-… 18.5 minutes 0.0000111 58 7
## 15 60 666666 simple-… 18.5 minutes 0.0000111 60 0
## 16 61 hello simple-… 3.43 hours 0.000124 61 4
## 17 67 sexy cool-ma… 7.92 minutes 0.00000475 67 6
## 18 77 1111 simple-… 11.1 seconds 0.000000111 77 0
## 19 80 121212 simple-… 18.5 minutes 0.0000111 80 1
## # … with 1 more variable: font_size <dbl>
plot_1 <- ggplot(data = passtime, mapping = aes(x = reorder(password,-offline_crack_sec), y = offline_crack_sec)) +
geom_col(aes(fill = offline_crack_sec)) +
scale_fill_viridis_c() +
labs(y = "Time to crack offline in seconds", x = "Passwords", title = "Time to crack passwords offline in seconds") +
coord_flip()ggplotly(plot_1)From this, the top worst passwords would be 6969, 2000, 1234 and 1111. From the 2 data above, we can conclude that 2000 and 1111 are amongst some of the weakest passwords (both against online and offline hacking) yet used by a lot. However, the crown would have to go to 2000 as its rank (20) is significantly lower than 1111’s (77).
There may be a logical reasoning behind this conclusion as a lot of people tend to use a date that’s significant to their lives (birthdate, wedding date, etc.) as their passwords and 2000 may be the perfect passcode as it may be a year that is difficult to forget. However, what makes 2000 a different year is that its triple zeroes are easy to guess, both by people and computers, as people are known to use 0 and 1 the most for their passwords.
passstrong <- passwords_new %>%
filter(strength > 1.3*median(strength)) %>%
filter(rank < 100)
passstrong## # A tibble: 4 x 9
## rank password category value time_unit offline_crack_s… rank_alt strength
## <dbl> <chr> <fct> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 13 abc123 simple-… 3.7 weeks 0.0224 13 32
## 2 22 superman name 6.91 years 2.17 22 10
## 3 26 trustno1 simple-… 92.3 years 29.0 26 25
## 4 66 computer nerdy-p… 6.91 years 2.17 66 10
## # … with 1 more variable: font_size <dbl>
library(ggrepel)
plot7 <- ggplot(data = passwords_new, mapping = aes(x = password, y = rank)) +
geom_jitter(aes(colour = strength)) +
geom_label_repel(data = passstrong, aes(label = password), size = 2) +
facet_wrap(~strength) +
labs(x = NULL, y = "Rank", title = "Passwords based on rank and strength") +
theme_minimal() +
theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_blank())
plot7Based on this, we highlighted on these few passwords that are ranked below 100 yet have strength of 1.5 times the median strength of all the passwords.
Let’s see whether these passwords are tough to crack through offline as well.
passhigh <- passwords_new %>%
filter(offline_crack_sec > 100*median(offline_crack_sec)) %>%
filter(rank < 100)
passhigh## # A tibble: 13 x 9
## rank password category value time_unit offline_crack_s… rank_alt strength
## <dbl> <chr> <fct> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 1 password passwor… 6.91 years 2.17 1 8
## 2 8 baseball sport 6.91 years 2.17 8 4
## 3 9 football sport 6.91 years 2.17 9 7
## 4 18 jennifer name 6.91 years 2.17 18 9
## 5 22 superman name 6.91 years 2.17 22 10
## 6 26 trustno1 simple-… 92.3 years 29.0 26 25
## 7 41 michelle name 6.91 years 2.17 41 8
## 8 43 sunshine fluffy 6.91 years 2.17 43 9
## 9 53 starwars nerdy-p… 6.91 years 2.17 53 8
## 10 66 computer nerdy-p… 6.91 years 2.17 66 10
## 11 74 corvette cool-ma… 6.91 years 2.17 74 8
## 12 83 princess fluffy 6.91 years 2.17 83 8
## 13 99 iloveyou fluffy 6.91 years 2.17 99 9
## # … with 1 more variable: font_size <dbl>
plot_2 <- ggplot(data = passhigh, mapping = aes(x = reorder(password, offline_crack_sec), y = offline_crack_sec)) +
geom_col(aes(fill = offline_crack_sec)) +
scale_fill_viridis_c() +
labs(y = "Time to crack offline in seconds", x = "Passwords", title = "Time to crack passwords offline in seconds") +
coord_flip()ggplotly(plot_2)From these plots, we found out that the password trustno1 is the most popular, yet the hardest to crack through online as well as offline. The reason why this may be the hardest to crack is because it uses the combination of letters and numbers that made it difficult to crack. On the other hand, the reason why it is considered popular (rank 26) is because the phrase itself is easy to remember.
In this report, I have analysed various kinds of bad passwords as well as their popularity, strength and category. However, I would like to heed some warning against using these passwords as they are easy to hack into. It is, therefore, advisable to use high strength passwords to ward off hackers and protect crucial information and data stored in our devices and cards.
Lastly, I would like to end this with a comic regarding passwords that may be useful when selecting which password to use in the future.